[ci] split verify into 3 parallel jobs + aggregator (~7m -> ~2:45m) by trilamsr · Pull Request #72 · TraceCoreAI/tracecore

trilamsr · 2026-05-19T05:21:05Z

Summary

The verify CI job currently runs 16 steps sequentially — 427s wall time on PR #70 (the most recent merged PR). Split into three parallel jobs grouped by the longest single step, with an aggregator named verify that preserves the existing branch-protection required check.

Partition

Job	Steps	Est. wall time
`verify-test`	`coverage-check` (race tests)	~125s
`verify-lint`	`vet`, `lint`	~155s
`verify-static`	`license-check`, `generate-check`, `build-tags`, `tidy-check`, `register-lint`, `nccl-fr RCE`, `actionlint`, install+`zizmor`, `ci-fuzz-nccl-fr`, `govulncheck`, `doc-check`, `build`	~140s
`verify` (aggregator)	`needs: [verify-test, verify-lint, verify-static]`; `echo success`	~5s

Wall time: max(125, 155, 140) ≈ 155s vs 427s today — saves ~4.5 minutes per PR.

Why this shape

coverage-check is the longest single step (~125s) — kept on its own job so it bounds the lower wall-time edge.
vet and lint share golangci-lint setup; pairing them keeps that overhead amortized.
verify-static is a grab-bag bounded by build (~55s) + fuzz (~40s). If it grows past the lint pole, rebalance later.

Branch protection

Unchanged. The required check verify continues to exist as the aggregator; sub-job failures propagate via needs:. No gh api calls or admin steps required.

Cost

Runner-minute cost rises from 1x to ~3.3x current (each job pays setup-go; Go module cache amortizes the rest). For a repo at this PR cadence the wall-clock win pays for itself in operator time.

Local invariant

make ci is unchanged — still sequential developer convenience. CI just spreads the same Make targets across 3 jobs.

Test plan

CI runs on this PR — verify-test, verify-lint, verify-static all green; verify aggregator green.
Wall time on this PR's CI run is observably lower than PR [chore] pr-review-loop: sync PR title and body in phase 5 #70 (~7m → ~2:45m target).
Aggregator failure mode: not exercised in this PR (would require an intentional break) — verified by inspection of needs: semantics.
make actionlint and make zizmor both clean locally on the diff.

The verify job ran 16 steps sequentially for 427s wall time on PR #70. Split into three parallel jobs grouped by the longest pole: - verify-test: coverage-check (race tests) ~125s - verify-lint: vet + lint ~155s - verify-static: license/generate/build-tags/tidy/register-lint/ nccl-fr RCE/actionlint/zizmor/fuzz/govulncheck/ doc-check/build ~140s A final aggregator job `verify` declares `needs: [verify-test, verify-lint, verify-static]` and echoes success. Branch protection already requires `verify`; keeping the name means zero protection churn while the three sub-jobs surface as informational rows. Wall time drops from ~7m to ~2:45m (bounded by verify-lint). Runner-minute cost is ~3.3x current — each job pays setup-go cache. `make ci` is unchanged — still sequential developer convenience. Signed-off-by: Tri Lam <trilamsr@gmail.com>

Adds one sentence to the partition comment block so a contributor adding a new lint/test/scanner gate knows the default bucket (verify-static) and the rebalance trigger (when it pushes verify-static past the verify-lint pole). Surfaced during review of the parallelization change. Signed-off-by: Tri Lam <trilamsr@gmail.com>

## Summary Two root-cause fixes plus two compressions to the `pr-review-loop` skill. **Drops the ralph-loop plugin dependency.** The old path passed the full ~17 KB prompt body through `/ralph-loop:ralph-loop` as `$ARGUMENTS`, which Claude Code's permission classifier aborted with "parser timeout, resource limit, or over-length" on real-world invocation (caught when I tried to invoke the skill on PR #72). The documented manual-paste fallback hits the same classifier limit and was itself a dead end. The skill now orchestrates the 5 phases directly in the current session via `Agent` tool dispatches; no plugin invocation. **Adds a diff-size + path precondition.** Under 200 changed lines AND all paths in `*.md` / `.github/**` / `docs/**` / `.claude/**` auto-offers a single-subagent review instead of paying for ~14 subagent dispatches and 5 review-pass commits. The skill's own "When to run / Skip" guidance already named these as the wrong fit; the gate now enforces it. Operator opts in explicitly to keep the heavy loop. **Cuts.** The two-form `M<X>` / `(no-milestone)` substitution rule (resolved in the precondition step now, not in the prompt body). The enumerated MEMORY.md slug list (slugs are already in context via auto-memory). The reviewer-brief's motivational prose. The duplicate stopping-rule block (merged with acceptance into one canonical checklist). **Keeps.** All 5 phase definitions (Phase 1 self → Phase 2 stakeholder-lens parallel → Phase 3 adversarial → Phase 4 A+ aspiration → Phase 5 simplify + PR title/body sync). Validation cycle. TDD discipline. Reviewer-brief output format. Rubric evolution flow. Final-report format. Readiness-audit gate before completion promise. Net diff: -131 lines (-347/+216), 670 → 538 line file. The bigger change is structural: the skill is now self-contained instead of delegating to a third-party plugin. ## Test plan - [ ] Skim the precondition block — `#4` correctly auto-routes small CI/docs/`.claude/` PRs to a single-subagent path and only continues to 5 phases on explicit operator confirmation. - [ ] Skim "How the loop runs" — direct orchestration is clear, no leftover ralph-loop invocation language. - [ ] Spot-check that all 5 phase definitions are intact and reference the same `<reviewer-brief>`. - [ ] Confirm the merged acceptance/stopping-rule block still gates the completion promise on every prior criterion (PR title sync, AI-vocab clean, `make ci` green, `gh pr checks` green, DCO trailers, no `main` commits). - [ ] `make doc-check` clean on this branch (banned-phrase lint, link resolution, no growth in `(unverified)` count). - [ ] On the next `/pr-review-loop` invocation against a real PR, verify: (a) milestone auto-resolves from the branch name; (b) small/low-risk diff triggers the single-subagent prompt; (c) explicit override starts the 5 phases without trying to invoke `/ralph-loop:ralph-loop`. ## Out of scope Two items from the optimization sketch were not in this PR's ask and remain as follow-ups: dropping the rubric-evolution machinery if it isn't load-bearing in practice, and a lightweight sibling skill (`/pr-review`) for the everyday case. Signed-off-by: Tri Lam <trilamsr@gmail.com>

## Summary PR #72's verify split introduced a silent branch-protection bypass. The aggregator job uses default `needs:` semantics, so any sub-job failure (`verify-test`, `verify-lint`, or `verify-static`) results in the aggregator being SKIPPED. GitHub treats SKIPPED required checks as satisfied, which means any failing sub-job allowed the PR to merge as if all checks passed. Caught on PR #73, which merged at 05:58:35Z despite `verify-test` failing at 05:57:19Z. The required check `verify` showed as SKIPPED rather than FAILURE, and protection treated the SKIPPED state as a pass. ## Fix The aggregator now runs `if: always()` (so it runs even when sub-jobs fail) and explicitly inspects each `needs.<sub-job>.result`, exiting non-zero when any sub-job did not return `success`. Each non-success result emits a `::error::` annotation naming the offending sub-job for operator debuggability. ```yaml verify: needs: [verify-test, verify-lint, verify-static] if: always() steps: - run: | fail=0 if [ "${{ needs.verify-test.result }}" != "success" ]; then ... ... [ "$fail" -eq 1 ] && exit 1 echo "all verify-* gates passed" ``` ## Why it matters Without this fix, the entire branch protection on `main` is effectively trivial — any flaky sub-job silently lets a merge through. The hole was open for the window between PR #72 (merged ~05:36Z) and this fix. ## Test plan - [ ] `make actionlint` clean ✅ - [ ] `make zizmor` clean ✅ (no findings, suppressed count went up because `${{ needs.*.result }}` is correctly recognized as a safe template expression). - [ ] CI on this PR exercises the new aggregator path: all three sub-jobs run, aggregator inspects results. If all pass, aggregator passes. If any fails, aggregator will FAIL (the previous behavior would have skipped here). - [ ] After merge, the next PR that hits a sub-job failure (e.g., a flaky test) will correctly block on `verify` reporting FAILURE. ## Out of scope The `TestReceiver_SLIBudget` flake observed on PR #73 (`p99 539ms > 500ms target` under race-on runner contention) is unrelated to this fix and worth a separate follow-up — either widen the budget under race or skip the SLI assertion in race mode. Signed-off-by: Tri Lam <trilamsr@gmail.com>

## Summary After PR #72's verify split (wall 427s → 242s actual, vs 155s projected), investigated whether setup-go cache sharing could close the gap. It can't: all three `verify-*` jobs already share the same cache key and hit it. The gap is `verify-test` bounded at 235s by `make coverage-check` (race-on `go test ./...` across 50+ packages). Records two follow-ups in `docs/FOLLOWUPS.md` so future work doesn't re-chase setup-go: 1. `coverage-check` sharding across multiple jobs — real wall-time win (~150s target) but a real refactor (split test execution + coverage aggregation). Triggered by PR throughput becoming binding. 2. Move `build-tags` and `build` from verify-static into verify-lint — saves ~17% runner-minutes via dedupe of compile work, zero wall-time. Cheap, but no speed benefit; triggered by runner-minute pressure. ## Test plan - [x] `make doc-check` clean (banned-phrase, link-resolution, `(unverified)` baseline). - [ ] FOLLOWUPS-only PR; no code or workflow changes. ```release-notes No user-facing change. Records a CI wall-time investigation under docs/FOLLOWUPS.md so future agents don't re-chase setup-go cache sharing as a speedup lever. ``` Signed-off-by: Tri Lam <trilamsr@gmail.com>

## Summary Adds two load-bearing lessons to `AGENTS.md` from this session's CI work. Both prevent a future contributor from repeating the same trap. **Aggregator bypass.** GitHub Actions short-circuits an aggregator job's `needs:` to SKIPPED on any sub-job failure, and treats SKIPPED required checks as satisfied. PR #73 silently merged past a failed `verify-test` because the aggregator from PR #72's verify split was SKIPPED rather than FAILURE. The fix shape (`if: always()` + `needs.*.result` check) shipped in PR #74; this lesson documents the trap and the fix so anyone splitting CI jobs in the future doesn't repeat it. **Perf-budget regex flake class.** `require.Regexp` with implicit upper bounds (e.g. `0\.0[0-9]+`) on values whose only invariant is `>0` flake on slow CI runners. Two of these hit in one session: `TestReceiver_SLIBudget` (emit-latency, observed 539ms) and `TestReceiver_SetDegraded` (degraded-seconds, observed 0.126s). The fix shape is the same in both — relax to any positive value (`\d+\.[0-9]*[1-9]`) or use baseline-relative comparisons. File goes from 128 to 148 lines (cap is 150, with 2 lines of remaining headroom — next addition should consider demoting an older entry to a topic note per the file's own promotion rule). ## Test plan - [x] `wc -l AGENTS.md` reports 148, under the 150-line cap. - [x] `make doc-check` clean (banned-phrase lint, 250 links resolve, `(unverified)` count = 7 baseline). - [x] Capture-flow format check (`learn-from-mistakes` skill): banned vocabulary absent, no first-person AI phrasing, no AI attribution, both entries carry `Anchor:` citations. - [ ] CI on this PR exercises the same gates plus the aggregator that's now itself an anchor of lesson #1. ```release-notes NONE — documentation only. Adds two load-bearing lessons to `AGENTS.md` covering GitHub Actions aggregator semantics and a recurring perf-budget regex flake class. No runtime behavior change. ``` Signed-off-by: Tri Lam <trilamsr@gmail.com>

## Summary Three observations from the recent session that didn't fit the structured surfaces (already used for the load-bearing AGENTS.md entries in PR #81 and the agent-internal notes in PR #82). Each captured via the `learn-from-mistakes` flow and lands in its existing topic note. **`.claude/notes/automation.md`** — *Memory captures rationale; hooks enforce.* The pre-PR checklist personal memory landed mid-session was followed by a lint failure shipped to CI within the hour. The same gap closed reliably by the `PreToolUse` hook installed shortly after. For any "always do X before Y" pattern, prefer the hook; the memory documents *why* the hook exists. **`docs/notes/ci.md`** — *Frame CI / perf projections as ranges, not single numbers.* PR #72's 155s wall-time projection vs 242s actual cost an investigation round (later landed in PR #77) because the projection's setup-go-cache amortization assumption was unverified. Either verify assumptions empirically before publishing the number, or frame as a range. **`.claude/notes/review-patterns.md`** — *Self-rate work, then write criteria for the next grade up.* Forces articulation of measurable improvements rather than free-form "anything else?". PR #76's B+ → A → A+ came from two iterations of this exact pattern; each iteration closed real structural gaps. A fourth lesson — "fix existing tools before proposing new ones" — was captured to personal memory (no PR, lives in `~/.claude/projects/.../memory/`), not the repo, because it's a judgment heuristic about my own decision-making rather than a repo-resident convention. ## Test plan - [x] `make doc-check` clean (banned-phrase lint, link resolution, all gates). - [x] `learn-from-mistakes` format check: banned vocabulary absent, no first-person AI phrasing, no AI attribution, all three entries carry `Anchor:` lines pointing at concrete PRs. - [ ] CI on this PR exercises `doc-check` + `pr-lint`. ## Rollback Each entry is a self-contained `### title` + body + `Anchor:` block at the top of its file. No dependents elsewhere; reverting is a single Edit per file. ```release-notes NONE — documentation only. Three meta-lessons from a recent session retrospective land in their existing topic notes (`automation.md`, `ci.md`, `review-patterns.md`). No runtime behavior change. ``` Signed-off-by: Tri Lam <trilamsr@gmail.com>

trilamsr added 2 commits May 18, 2026 22:20

trilamsr enabled auto-merge (squash) May 19, 2026 05:32

trilamsr merged commit 7375cef into main May 19, 2026
8 checks passed

trilamsr deleted the ci/split-verify-3way branch May 19, 2026 05:36

trilamsr mentioned this pull request May 19, 2026

[chore] pr-review-loop: orchestrate directly + add diff-size gate #73

Merged

6 tasks

trilamsr mentioned this pull request May 19, 2026

[ci] verify aggregator: fail on sub-job failure, not skip #74

Merged

4 tasks

trilamsr mentioned this pull request May 19, 2026

[docs] FOLLOWUPS: record CI wall-time investigation post-PR #72 #77

Merged

2 tasks

trilamsr mentioned this pull request May 19, 2026

[docs] AGENTS.md: aggregator bypass + perf-regex flake lessons #81

Merged

4 tasks

trilamsr mentioned this pull request May 19, 2026

[docs] session retrospective: three meta-lessons captured #84

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ci] split verify into 3 parallel jobs + aggregator (~7m -> ~2:45m)#72

[ci] split verify into 3 parallel jobs + aggregator (~7m -> ~2:45m)#72
trilamsr merged 2 commits into
mainfrom
ci/split-verify-3way

trilamsr commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

trilamsr commented May 19, 2026

Summary

Partition

Why this shape

Branch protection

Cost

Local invariant

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant